High-Performance Computing: An overview
Parallel computing in R
Extended examples
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
Buy a bigger computer/RAM memory (not the best solution!)
Use out-of-memory storage, i.e., don’t load all your data in the RAM. e.g. The bigmemory, data.table, HadoopStreaming R packages
Store it more efficiently, e.g.: Sparse Matrices (take a look at the dgCMatrix objects from the Matrix R package)
Flynn’s Classical Taxonomy (Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory)
We will be focusing on the Single Instruction stream Multiple Data stream
In raw terms
Supercomputer: A single big machine with thousands of cores/gpus.
High Performance Computing (HPC): Multiple machines within a single network.
High Throughput Computing (HTC): Multiple machines across multiple networks.
You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accesible these days, e.g. AWS provides a service to create HPC clusters at a low cost (allegedly, since nobody understands how pricing works)
Ask yourself these questions before jumping into HPC!
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism:
parallel: R package that provides ‘[s]upport for parallel computation, including random-number generation’.
foreach: R package for ‘general iteration over elements’ in parallel fashion.
future: ‘[A] lightweight and unified Future API for sequential and parallel processing of R expression via futures.’ (won’t cover here)
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow
And there’s also a more advanced set of options
(Usually) We do the following:
Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)
Copy/prepare each R session (if you are using a PSOCK cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
Do your call: parApply, parLapply, etc.
Stop the cluster with clusterStop
Can be created with makePSOCKCluster
Creates brand new R Sessions (so nothing is inherited from the master), e.g.
# This creates a cluster with 4 R sessions
cl <- makePSOCKCluster(4)Child sessions are connected to the master session via Socket connections
Can be created outside of the current computer, i.e. accross multiple computers!
Fork Cluster makeForkCluster:
Uses OS Forking,
Copies the current R session locally (so everything is inherited from the master up to that point).
Data is only duplicated if it is altered (need to double check when this happens!)
Not available on Windows.
makeCluster: passed to snow (Simple Network of Workstations)makePSOCKCluster# 1. CREATING A CLUSTER
library(parallel)
nnodes <- 4L
cl <- makePSOCKcluster(nnodes)
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
# 3. DO YOUR CALL
ans <- parSapply(cl, 1:nnodes, function(x) runif(1e3))
(ans0 <- var(ans))## [,1] [,2] [,3] [,4]
## [1,] 0.0861888293 -0.0001633431 5.939143e-04 -3.672845e-04
## [2,] -0.0001633431 0.0853841838 2.390790e-03 -1.462154e-04
## [3,] 0.0005939143 0.0023907904 8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06 8.467722e-02
Making sure is reproducible
# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:nnodes, function(x) runif(1e3)))
# 4. STOP THE CLUSTER
stopCluster(cl)
all.equal(ans0, ans1) # All equal!## [1] TRUE
makeForkClusterIn the case of makeForkCluster
# 1. CREATING A CLUSTER
library(parallel)
# The fork cluster will copy the -nsims- object
nsims <- 1e3
nnodes <- 4L
cl <- makeForkCluster(nnodes)
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123)
# 3. DO YOUR CALL
ans <- do.call(cbind, parLapply(cl, 1:nnodes, function(x) {
runif(nsims) # Look! we use the nsims object!
# This would have fail in makePSOCKCluster
# if we didn't copy -nsims- first.
}))
(ans0 <- var(ans))## [,1] [,2] [,3] [,4]
## [1,] 0.0861888293 -0.0001633431 5.939143e-04 -3.672845e-04
## [2,] -0.0001633431 0.0853841838 2.390790e-03 -1.462154e-04
## [3,] 0.0005939143 0.0023907904 8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06 8.467722e-02
Again, we want to make sure this is reproducible
# Same sequence with same seed
clusterSetRNGStream(cl, 123)
ans1 <- var(do.call(cbind, parLapply(cl, 1:nnodes, function(x) runif(nsims))))
ans0 - ans1 # A matrix of zeros## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 0
# 4. STOP THE CLUSTER
stopCluster(cl)mclapply (Forking on the fly)In the case of mclapply, the forking (cluster creation) is done on the fly!
# 1. CREATING A CLUSTER
library(parallel)
# The fork cluster will copy the -nsims- object
nsims <- 1e3
nnodes <- 4L
# cl <- makeForkCluster(nnodes) # mclapply does it on the fly
# 2. PREPARING THE CLUSTER
set.seed(123)
# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims)))
(ans0 <- var(ans))## [,1] [,2] [,3] [,4]
## [1,] 0.085384184 0.002390790 0.006576204 -0.003998278
## [2,] 0.002390790 0.081142190 0.001846963 0.001476244
## [3,] 0.006576204 0.001846963 0.085175347 -0.002807348
## [4,] -0.003998278 0.001476244 -0.002807348 0.082425477
Once more, we want to make sure this is reproducible
# Same sequence with same seed
set.seed(123)
ans1 <- var(do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims))))
ans0 - ans1 # A matrix of zeros## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 0
# 4. STOP THE CLUSTER
# stopCluster(cl) no need of doing this anymoreFriendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).
Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).
Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread safe.
Pseudo Random Number Generation is not very straight forward… But C++11 has a nice set of functions that can be used together with OpenMP
Need to think about how processors work, cache memory, etc. Otherwise you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing
If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):
~$ R --debugger=valgrindTell Rcpp that you need to include that in the compiler:
#include <omp.h>
// [[Rcpp::plugins(openmp)]]Within your function, set the number of cores, e.g
// Setting the cores
omp_set_num_threads(cores);Tell the compiler that you’ll be running a block in parallel with openmp
#pragma omp [directives] [options]
{
...your neat parallel code...
}You’ll need to specify how OMP should handle the data:
shared: Default, all threads access the same copy.private: Each thread has its own copy, uninitialized.firstprivate Each thread has its own copy, initialized.lastprivate Each thread has its own copy. The last value used is returned.Setting default(none) is a good practice.
Compile!
Our own version of the dist function… but in parallel!
#include <omp.h>
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
// Some constants
int N = (int) X.n_rows;
int K = (int) X.n_cols;
// Output
arma::mat D(N,N);
D.zeros(); // Filling with zeros
// Setting the cores
omp_set_num_threads(cores);
#pragma omp parallel for shared(D, N, K, X) default(none)
for (int i=0; i<N; ++i)
for (int j=0; j<i; ++j) {
for (int k=0; k<K; k++)
D.at(i,j) += pow(X.at(i,k) - X.at(j,k), 2.0);
// Computing square root
D.at(i,j) = sqrt(D.at(i,j));
D.at(j,i) = D.at(i,j);
}
// My nice distance matrix
return D;
}# Simulating data
set.seed(1231)
K <- 5000
n <- 500
x <- matrix(rnorm(n*K), ncol=K)
# Are we getting the same?
table(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros##
## 0
## 250000
# Benchmarking!
rbenchmark::benchmark(
dist(x), # stats::dist
dist_par(x, cores = 1), # 1 core
dist_par(x, cores = 4), # 2 cores
dist_par(x, cores = 8), # 4 cores
replications = 1, order="elapsed"
)[,1:4]## test replications elapsed relative
## 4 dist_par(x, cores = 8) 1 0.636 1.000
## 3 dist_par(x, cores = 4) 1 1.204 1.893
## 2 dist_par(x, cores = 1) 1 2.530 3.978
## 1 dist(x) 1 6.477 10.184
library(future)
plan(multicore)
# We are creating a global variable
a <- 2
# Creating the futures has only the overhead (setup) time
system.time({
x1 %<-% {Sys.sleep(3);a^2}
x2 %<-% {Sys.sleep(3);a^3}
})
## user system elapsed
## 0.024 0.008 0.033
# Let's just wait 5 seconds to make sure all the cores have returned
Sys.sleep(3)
system.time({
print(x1)
print(x2)
})
## [1] 4
## [1] 8
## user system elapsed
## 0.004 0.000 0.004
gvegayon
@gvegayon
ggvy.cl
For more, checkout the CRAN Task View on HPC
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] future_1.10.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 codetools_0.2-15 listenv_0.7.0 revealjs_0.9
## [5] digest_0.6.15 rprojroot_1.3-2 backports_1.1.2 magrittr_1.5
## [9] evaluate_0.10.1 icon_0.1.0 highr_0.6 stringi_1.2.3
## [13] rmarkdown_1.10 tools_3.5.0 stringr_1.3.0 yaml_2.1.19
## [17] compiler_3.5.0 globals_0.12.4 htmltools_0.3.6 knitr_1.20
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
pisim <- function(i, nsim) { # Notice we don't use the -i-
# Random points
ans <- matrix(runif(nsim*2), ncol=2)
# Distance to the origin
ans <- sqrt(rowSums(ans^2))
# Estimated pi
(sum(ans <= 1)*4)/nsim
}# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]## test replications elapsed relative
## 1 parallel 1 0.441 1.000
## 2 serial 1 1.154 2.617
## par ser R
## 3.141126 3.141247 3.141593
Suppose that we would like to maximize/minimize a function using an stochastic optimization algorithm, namely, the Artificial Bee Colony algorithm
The following R script (01-slurm-abcoptim.R) was designed to work with Slurm (it requires the R package ABCoptim [@ABCoptim])
# Include this to tell where everything will be living at
.libPaths("~/R/x86_64-pc-linux-gnu-library/3.4/")
# Default CRAN mirror from where to download R packages
options(repos =c(CRAN="https://cloud.r-project.org/"))
# You need to have the ABCoptim R package
library(ABCoptim)
fun <- function(x) {
-cos(x[1])*cos(x[2])*exp(-((x[1] - pi)^2 + (x[2] - pi)^2))
}
ans <- abc_optim(rep(0,2), fun, lb=-10, ub=10, criter=50)
saveRDS(
ans,
file = paste0(
"~/hpc-with-r/examples/01-slurm-abcoptim-",
Sys.getenv("SLURM_JOB_ID"), # SLURM ENV VAR
"-",
Sys.getenv("SLURM_ARRAY_TASK_ID"), # SLURM ENV VAR
".rds"
))SLURM_JOB_ID, and SLURM_ARRAY_TASK_ID to save our results (both environment variables created by slurm)To run the previous R script, we can use the following bash file (01-slurm-abcoptim.sh)
#!/bin/bash
#SBATCH --tasks=1
#SBATCH --array=1-3
#SBATCH --job-name=01-slurm-abcoptim
#SBATCH --output=01-slurm-abcoptim-%A_%a.out
source /usr/usc/R/3.4.0/setup.sh
Rscript --vanilla ~/hpc-with-r/examples/01-slurm-abcoptim.R Here we are taking advantage of the Slurm Arrays, so we are running the same R-script in 3 instances (--array=1-3)
To run the job we just need to type
$ sbatch 01-slurm-abcoptim.shMake sure you modify the file paths so that it matches your files!
Now you try it!